Hypothesis testing
The frequentist framework for hypothesis testing is rooted in Ronald Fisher’s work in the early 20th century. Fisher’s approach is based on the idea of null hypothesis significance testing. The basic idea is to test the null hypothesis that there is no effect against the alternative hypothesis that there is an effect.
Fisher had a lady friend in London who said she could taste a cup of tea and tell whether the milk was poured before or after the tea. Fisher was skeptical of the lady’s claim that she could taste the difference, so he devised a test to determine if she could indeed taste the difference.
He designed an experiment preparing eight cups of tea, four with milk first and four with milk second. The lady was to taste the tea and determine which 4 cups were prepared with milk first.
Fisher’s null hypothesis was that the lady could not taste the difference; any of the cups on which she guessed correctly were random correct guesses.
The alternative hypothesis was that she could taste the difference - that correct guesses were systematic.
Take a look at this simulation and description of Fisher’s tea experiment.
Terms
The null hypothesis, usually denoted \(H_0\), is the hypothesis that there is no effect. This is what we’re hoping to show to be wrong.
The alternative hypothesis, usually denoted \(H_1\) or \(H_a\), is the hypothesis that there is an effect, usually what we suspect to be true.
p-value - is the probability of observing a test statistic as extreme as the one observed, given that the null hypothesis is true. The smaller the p-value, the less plausible the null hypothesis.
The p-value is the probability of observing the data if the null hypothesis is true. It tells us how likely it is these data exist in the world where the null is correct. It is not the probability the null is false or the alternative is true.
Hypothesis testing process
Generate the null hypothesis and alternative hypothesis. The two must be mutually exclusive and exhaustive. For Fisher, the null hypothesis was that the lady could not taste the difference; the alternative hypothesis was that she could taste the difference.
Determine the critical value. This is the value that the test statistic must exceed to reject the null hypothesis. We calculate this based on a significance level (e.g. 0.05). For Fisher, the distribution of guesses is binomial with 8 guess (n=8) and p=.5 (random guessing). The critical value is the number of correct guesses that would be unlikely to occur by chance (e.g. 0.05 probability of occurring by chance), computed as
critical_value <- qbinom(1 - 0.05, size = 8, prob = 0.5)which equals 7. So if we generate a test statistic of 7 or more correct guesses, we would reject the null hypothesis.Using an appropriate sample of data, calculate the test statistic. For Fisher, this would be the number of correct guesses out of 8 cups of tea in the actual experiment.
Determine the probability of observing the test statistic value, given the null hypothesis is true. This is the p-value. For Fisher, the p-value is the probability of observing 7 or more correct guesses out of 8 cups of tea, given that the lady was guessing randomly:
p_value <- 1 - pbinom(7 - 1, size = 8, prob = 0.5)which equals 0.03516.Compare the p-value to a test level, \(\alpha\). It’s common to set \(\alpha=.05\). If the p-value is less than the test level (\(\alpha\)), reject the null hypothesis. If the p-value is greater than the critical value, fail to reject the null hypothesis. For Fisher, the p-value of this test statistic is 0.03516 which is less than 0.05 so we can reject the null hypothesis.
We reject the null, or we fail to reject the null. We never accept the null, we never accept the alternative. We only show that the null is unlikely because our observing our data is so unlikely if the null is true.
Non-directional hypothesis: Two-tailed test
A two-tailed test is appropriate to a null hypothesis that \(\beta = 0\). The alternative hypothesis is that \(\beta \neq 0\). We have no expectation about direction, just that \(\beta\) is not zero. This is called a non-directional hypothesis.
You can see the critical region of the t-distribution is two-sided so that it is either greater than or less than the t-values associated with .025 in the tails (2*.025=.05).
Directional hypothesis: One-tailed test
A one-tailed test is appropriate to a null hypothesis that \(\beta \leq 0\) or \(\beta \geq 0\). The alternative hypothesis is that \(\beta > 0\) or \(\beta < 0\). We have an expectation about the direction of the effect. This is called a directional hypothesis.
Fisher’s hypothesis is directional. The null is that the lady’s sensitivity to the tea is 50-50 or random, so the expected number of correct guesses would be less than or equal to \(.5*n=4\). The alternative is that she can taste the difference, so the alternative hypothesis is that the expected number of correct guesses would be greater than 4.
Here, you can see the critical region of the distribution is one-sided so that it is strictly greater than the t-value associated with .05 in the right tail.
Fisher’s hypothesis is directional; if the lady correctly guesses 7 or more cups of tea, we reject the null hypothesis. We do not accept the alternative; we have only shown that the null is unlikely. The p-value (.03) tells us the chances of her guessing 7 or more cups correctly if the null were true (that she can’t tell the difference and is guessing randomly).
A coefficient is never significant but in the wrong direction. The one-tailed figure above makes it clear that \(\beta_k\) is significant iff it is in the right tail of the distribution, not in the left. So a t-statistic of -5 would be way down in the left tail, but would not allow us to reject the null hypothesis.
Statistical Errors
Two types of inferential errors are possible in this hypothesis testing structure. A Type I error is the rejection of a true null hypothesis. It is a false positive, so a fire alarm when there’s no fire.
The probability of a Type I error is \(\alpha\), the significance level. A Type II error is the failure to reject a false null hypothesis. It is a false negative, so a silent fire alarm when there is a fire. The probability of a Type II error is denoted \(\beta\). The power of the test is \(1 - \beta\); this is the probability of correctly rejecting a false null hypothesis. The power of a fire alarm is it probability of going off when there is a fire, so rejected the false null that there is no fire.
It’s worth noting Kennedy (2002) discusses the Type III error:
A type III error, introduced in Kimball (1957), occurs when a researcher produces the right answer to the wrong question. A corollary of this rule, as noted by Chatfield (1995, p. 9), is that an approximate answer to the right question is worth a great deal more than a precise answer to the wrong question.
This is from Peter Kennedy’s very good paper “Sinning in the Basement.”
| H₀ is True | H₀ is False | |
|---|---|---|
| Statistical Decision | ||
| Fail to Reject H₀ | Correct Decision (1 - α) | Type II Error (β) |
| Reject H₀ | Type I Error (α) | Correct Decision (1 - β, Power) |
Error Types
- Type I Error (α): Rejecting a true null hypothesis (false positive)
- Example: Concluding a treatment has an effect when it actually doesn’t
- Probability = α (significance level, typically 0.05)
- Type II Error (β): Failing to reject a false null hypothesis (false negative)
- Example: Concluding a treatment has no effect when it actually does
- Probability = β
- Power of the test = 1 - β (ability to detect an effect when it exists)
Hypothesis testing in the regression context
How does this map onto testing hypotheses in regressions?
- The null hypothesis is that the coefficient is zero, \(H_0: \beta_k = 0\).
- The alternative hypothesis is that the coefficient is not zero, \(H_1: \beta_k \neq 0\).
- The test statistic is the t-statistic, \(t = \frac{\widehat{\beta_k}}{SE(\widehat{\beta_k})}\).
- The p-value is the probability of observing the test statistic, given the null hypothesis is true.
- The critical value is the value of the test statistic that would lead us to reject the null hypothesis.
- The significance level is the probability of a Type I error, \(\alpha\).
- The power of the test is the probability of correctly rejecting a false null hypothesis, \(1 - \beta\).
You should note the star of the show, the test statistic, comes from our calculation of \(\widehat{\beta_k}\) and \(SE(\widehat{\beta_k})\).
Uncertainty about \(\widehat{\beta}\)
We’ve dug pretty extensively into how we produce OLS estimates of \(\widehat{\beta}\). Now we turn to the question of uncertainty about those estimates. After all, the data are a sample; the variables are imperfect measures; the variables may contain errors; we have ideas about the data generating process, but we don’t know the true model, so the model is certainly misspecified.
We need ways to express our uncertainty about the estimates of \(\beta\), and about our confidence in the claims we make from the model.
If we simply assume \(\widehat{\beta} = \beta\), then we are assuming:
- all the regression assumptions are met;
- the sample is exactly equivalent to or representative of the population;
- the model is specified correctly, and there are no sources of measurement error.
The probability these are all true is slim, but we are uncertain about the extent to which we do or do not meet these requirements. We need some statistical tools for quantifying our uncertainty about \(\widehat{\beta}\).
Measuring Uncertainty
Two parts to this.
The residuals and \(X\) variables are uncorrelated. Recall our simulations of what happens if we violate this assumption. The \(\widehat{\beta}\)s are biased, and the standard errors are inefficient.
The residuals are normally distributed. This is not a critical assumption, but it is a convenient one. It allows us to make inferences about the \(\widehat{\beta}\)s.
Normality of \(\widehat{\beta}\)
Normality of \(\beta\) is facilitated by the Central Limit Theorem. \(\widehat{\beta}\) is our estimate of the mean of the distribution of \(\beta\); it’s one draw of theoretically infinite draws from that distribution. Assuming a large number of draws, the Central Limit Theorem tells us that the distribution of \(\widehat{\beta}\) is normal.
Framework for Inference
- establish a null hypothesis, e.g. \(\beta_1=0\).
- estimate \(\widehat{\beta_1}\)
- estimate the error variance \(\widehat{\sigma^2}\).
- determine the distribution of \(\widehat{\beta_k}\).
- compute a test statistic for \(\widehat{\beta_1}\).
- compare that test statistic to critical values on the distribution of \(\widehat{\beta_k}\).
- determine the probability of observing the test statistic value, given the distribution of \(\widehat{\beta_k}\)
Measuring Uncertainty
- Standard errors are basic measures of uncertainty.
- The standard errors of the estimates of \(\widehat{\beta_k}\) are the standard deviations of the sampling distribution of the estimator.
- Standard errors are analogous to standard deviations surrounding the estimates.
Repetition: Variance-Covariance of \(\epsilon\)
Because we assume constant variance and no serial correlation, we know
\[ E(\mathbf{\epsilon\epsilon'})= \left[ \begin{array}{cccc} \sigma^{2} & 0 &\cdots &0 \\ 0&\sigma^{2} &\cdots &0 \\ \vdots&\vdots&\ddots& \vdots \\ 0 &0 &\cdots & \sigma^{2} \\ \end{array} \right] = \sigma^{2} \left[ \begin{array}{cccc} 1 & 0 &\cdots &0 \\ 0&1 &\cdots &0 \\ \vdots&\vdots&\ddots& \vdots \\ 0 &0 &\cdots & 1 \\ \end{array} \right] = \sigma^{2} \mathbf{I} \]
This is the variance-covariance matrix of the disturbances (\(var-cov(e)\)); it is symmetric, the main diagonal containing the variances of \(\epsilon_i\). Assuming \(Var(u|X)= \sigma^2\), the average of the main diagonal is \(\widehat{\sigma^2}\); the average of a constant is the constant.
Variance-Covariance Matrix
Start with \(\hat{\beta} = (X'X)^{-1}X'y\) and substitute \(y = X\beta + \epsilon\):
\[\hat{\beta} = (X'X)^{-1}X'(X\beta + \epsilon)\] \[= \beta + (X'X)^{-1}X'\epsilon\]
Therefore: \[\hat{\beta} - \beta = (X'X)^{-1}X'\epsilon\]
The variance-covariance matrix is: \[Var(\hat{\beta}) = E[(\hat{\beta} - \beta)(\hat{\beta} - \beta)']\] \[= E[(X'X)^{-1}X'\epsilon\epsilon'X(X'X)^{-1}]\]
Under homoskedasticity (\(E[\epsilon\epsilon'] = \sigma^2I\)): \[Var(\hat{\beta}) = \sigma^2(X'X)^{-1}\]
Recall that \(\sigma^2 = \epsilon'\epsilon / (N - k)\), where \(N\) is the number of observations and \(k\) is the number of regressors including the constant.
Variance-Covariance of \(\widehat{\beta}\)
\[ E[(\widehat{\beta}-\beta)(\widehat{\beta}-\beta)'] = \] \(~\)
\[\left[ \begin{array}{cccc} var(\beta_1) & cov(\beta_1,\beta_2) &\cdots &cov(\beta_1,\beta_k)\\ cov(\beta_2,\beta_1)& var(\beta_2) &\cdots &cov(\beta_2,\beta_k)\\ \vdots&\vdots&\ddots& \vdots\\ cov(\beta_k,\beta_1) &cov(\beta_2,\beta_k) &\cdots & var(\beta_k)\\ \end{array} \right] \]
Standard Errors of \(\beta_k\)
\[ \left[ \begin{array}{cccc} \sqrt{var(\beta_1)} & cov(\beta_1,\beta_2) &\cdots &cov(\beta_1,\beta_k)\\ cov(\beta_2,\beta_1)& \sqrt{var(\beta_2)} &\cdots &cov(\beta_2,\beta_k)\\ \vdots&\vdots&\ddots& \vdots\\ cov(\beta_k,\beta_1) &cov(\beta_k,\beta_2) &\cdots & \sqrt{var(\beta_k)}\\ \end{array} \right] \]
Assume
\[ \widehat{u} \sim \mathcal{N}(0, \sigma^2 \mathbf{I}) \]
If we meet the GM assumptions, then:
\[ \widehat{\beta} \sim \mathcal{N}(\beta, \widehat{var(\beta)}) \]
\[\widehat{var(\beta)} = \widehat{\sigma^{2}} \mathbf{(X'X)^{-1}} \]
The estimated error variance will be smaller relative to large variation in the \(X\) variables, so as variation in \(X\) grows, the uncertainty surrounding \(\mathbf{\widehat{\beta_k}}\) will get smaller.
Let’s look at this in some detail.
Elements of the Variance-Covariance Matrix
Start with a simple regression model:
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \epsilon_i\]
where:
- \(\beta_0 = 1\) (intercept)
- \(\beta_1 = 2\) (slope for \(x_1\))
- \(\beta_2 = 3\) (slope for \(x_2\))
- \(\epsilon_i \sim N(0, 1)\) (random errors)
Thinking about \(\mathbf{X'X}\)
The \(\mathbf{X'X}\) matrix plays a central role in both coefficient estimates and their standard errors. Recall:
- Coefficient Estimates: \(\hat{\beta} = \mathbf{(X'X)^{-1}}\mathbf{X'y}\)
- Variance-Covariance Matrix: \(Var(\hat{\beta}) = \sigma^2\mathbf{(X'X)^{-1}}\)
- Standard Errors: \(SE(\hat{\beta}_j) = \sqrt{\sigma^2[\mathbf{(X'X)^{-1}}]_{jj}}\)
You an see \(\mathbf{X'X}\) or its inverse directly affect:
- How precisely we can estimate each coefficient
- How much correlation between predictors affects our estimates
- The magnitude of our standard errors
We’re going to simulate how changes in sample size and in the correlation between \(x_1, x_2\) affect the structure of X’X and the precision of our estimates.
Understanding the \(\mathbf{(X'X)}\) Matrix
The \(\mathbf{(X'X)}\) matrix, often called the information matrix, contains crucial information about our ability to estimate regression coefficients. Each element tells us something specific about estimation precision:
What matrix elements mean
- Diagonal Elements (i=j)
- \(\mathbf{X'X}_{11}\): Sum of squared ones (equals n)
- Controls intercept precision
- Larger n means better intercept estimation
- \(\mathbf{X'X}_{22}\): Sum of squared x₁ values
- Determines β₁ estimation precision
- Affected by both sample size and x₁ variance
- \(\mathbf{X'X}_{33}\): Sum of squared x₂ values
- Determines β₂ estimation precision
- Affected by both sample size and x₂ variance
- \(\mathbf{X'X}_{11}\): Sum of squared ones (equals n)
- Off-Diagonal Elements (i≠j)
- \(\mathbf{X'X}_{12}\) and \(\mathbf{X'X}_{13}\): Sums of x₁ and x₂
- Related to predictor centering
- Smaller values improve estimation
- \(\mathbf{X'X}_{23}\): Sum of x₁x₂ products
- Key indicator of multicollinearity
- Larger values relative to diagonals indicate problems
- \(\mathbf{X'X}_{12}\) and \(\mathbf{X'X}_{13}\): Sums of x₁ and x₂
Let’s generate three cases to compare these matrix properties:
- Good Case: Large sample size (1200), no correlation between predictors
- Medium Case: Medium sample size (50), moderate correlation (~.5) between predictors
- Bad Case: Small sample size (10), high correlation (~.9) between predictors
You can see characteristics of the \(\mathbf{(X'X)}\) matrix under each of the three sets of conditions. The condition number and determinant of the matrix are key indicators of multicollinearity and matrix stability. The condition number is effectively a measure of numerical accuracy - smaller values indicate better matrix properties. Condition increases as sample gets smaller and as correlation between predictors increases. The determinant of the matrix indicates columns of the matrix (variables) are distinct from one another. As the determinant approaches zero, that variability declines - at zero, the matrix is singular and cannot be inverted.
These heatmaps provide a visual representation of the information matrix structure under different conditions. The color gradient indicates the relative magnitude of each element, with darker colors representing larger values. Hover over each cell for detailed interpretations of the matrix elements.
The heatmaps above provide a visual representation of the X’X matrix under the three scenarios: described above. Here’s how to interpret them:
- Color Intensity:
- Darker shades of green indicate larger values in the matrix.
- Diagonal elements (top-left to bottom-right) represent the information content for each coefficient.
- Off-diagonal elements represent relationships between predictors.
- Diagonal Elements:
- The top-left cell (\(\beta_0, \beta_0\)) shows the sum of squared ones (equal to n).
- The middle cell (\(\beta_1, \beta_1\)) shows the sum of squared x₁ values.
- The bottom-right cell (\(\beta_2, \beta_2\)) shows the sum of squared x₂ values.
- Off-Diagonal Elements:
- The top-middle cell (\(\beta_0, \beta_1\)) shows the sum of x₁ values.
- The top-right cell (\(\beta_0, \beta_2\)) shows the sum of x₂ values.
- The middle-right cell (\(\beta_1, \beta_2\)) shows the sum of x₁x₂ products, which indicates collinearity.
In the Good Case, the main diagonal values are dominant, indicating high information content for all coefficients. Off-diagonal elements are small, suggesting efficient estimation of individual coefficients. In the Medium Case, off-diagonal elements grow larger, indicating some collinearity and reduced precision. In the Bad Case, off-diagonal elements are nearly as large as diagonal elements, indicating severe collinearity and poor estimation precision.
Inverse matrix
Here’s a similar visual analysis of the inverse matrix, \(\mathbf{(X'X)^{-1}}\):
\(\mathbf{(X'X)^{-1}}\) Elements
The \(\mathbf{(X'X)^{-1}}\) matrix is critical for understanding the variance-covariance structure of the regression coefficients. Here’s what each element means:
- Diagonal Elements:
- \(\mathbf{(X'X)^{-1}_{11}}\): Variance of the intercept (\(\beta_0\)) estimate. Larger values indicate higher uncertainty in the intercept.
- \(\mathbf{(X'X)^{-1}_{22}}\): Variance of the \(\beta_1\) estimate. Larger values indicate higher uncertainty in the slope for \(x_1\).
- \(\mathbf{(X'X)^{-1}_{33}}\): Variance of the \(\beta_2\) estimate. Larger values indicate higher uncertainty in the slope for \(x_2\).
- Off-Diagonal Elements:
- \(\mathbf{(X'X)^{-1}_{12}}\): Covariance between \(\beta_0\) and \(\beta_1\). Non-zero values indicate dependence between the intercept and \(x_1\) slope estimates.
- \(\mathbf{(X'X)^{-1}_{13}}\): Covariance between \(\beta_0\) and \(\beta_2\). Non-zero values indicate dependence between the intercept and ( x_2 ) slope estimates.
- \(\mathbf{(X'X)^{-1}_{23}}\): Covariance between \(\beta_1\) and \(\beta_2\). Larger values indicate collinearity between \(x_1\) and \(x_2\), leading to inflated standard errors.
For the Good Case (n=1200, ρ≈0): - The diagonal elements of \(\mathbf{(X'X)^{-1}}\) will be small, indicating precise estimates. - The off-diagonal elements will be close to zero, indicating minimal dependence between coefficients.
For the Bad Case (n=10, ρ≈0.9): - The diagonal elements of \(\mathbf{(X'X)^{-1}}\) will be large, indicating high uncertainty in the estimates. - The off-diagonal elements will be large, indicating strong collinearity and dependence between coefficients.
Coefficient Recovery
Let’s look briefly at how well we recover the coefficients:
Unsurprisingly, we recover the coefficients better under larger samples and lower correlation - however, you’ll notice again that the estimates are unbiased.
Characteristics of \(\text{var}(\widehat{\beta_{j}})\)
Inferences about \(\widehat{\beta}\)
The standard error of \(\widehat{\beta_k}\) is the square root of the \(k^{th}\) diagonal element of the variance-covariance matrix of \(\widehat{\beta}\).
In scalar terms,
\[s.e.(\widehat{\beta_{1}}) =\sqrt{\frac{\widehat{\sigma^{2}}}{\sum\limits_{i=1}^{n} (x_{i}-\bar{x})^{2}}} \nonumber \nonumber = \frac{\widehat{\sigma}}{\sqrt{\sum\limits_{i=1}^{n} (x_{i}-\bar{x})^{2}}} \]
and the critical value is given by
\[z=\frac{\widehat{\beta_{j}}-\beta_{j}}{\frac{\widehat{\sigma}}{\sqrt{\sum\limits_{i=1}^{n} (x_{i}-\bar{x})^{2}}}}\]
\[t= \frac{\widehat{\beta_{j}}-\beta_{j}\sqrt{\sum\limits_{i=1}^{n} (x_{i}-\bar{x})^{2}}}{\widehat{\sigma}} \]
This is not Normal, even though the numerator is - the denominator is composed of the squared residuals, each of which is a \(\chi^{2}\) variable. A standard normal variable divided by a \(\chi^{2}\) distributed variable is distributed \(t\) with \(n-k-1\) degrees of freedom.
Confidence Intervals
The estimates of \(\widehat{\beta_{j}}\) are drawn from a normal distribution, and we know the distribution of the variance, and we know how those distributions are related. For the \(t\) distribution, we know that 95% of the probability mass falls within 2 standard deviations of the mean, so if we construct a 95% confidence interval, we can say that 95% of the time the true value of \(\beta\) will fall within the interval. Put another way, the interval contains a range of probable values for the true value of \(\beta\).
Because we know (from above) that
\[t= \frac{\widehat{\beta_{j}}-\beta_{j}}{s.e. \widehat{\beta_{j}}}\]
we can easily compute a confidence interval surrounding \(\widehat{\beta_{j}}\) by
\[CI_{c} = \widehat{\beta_{j}} \pm c \cdot {s.e. \widehat{\beta_{j}}}\]
where \(c\) represents the size of the confidence interval, so if \(c\) is the 97.5th percentile in the \(t\) distribution with \(6\) degrees of freedom, then the value of \(c\) is 2.447 (see the t-table in the back of the text). Thus, we compute the confidence interval by:
\[CI_{97.5} = \widehat{\beta_{j}} + 2.447 \cdot {s.e. \widehat{\beta_{j}}}, \widehat{\beta_{j}} - 2.447 \cdot {s.e. \widehat{\beta_{j}}}\]
Point Estimates
Using confidence intervals, we state the probability the true coefficient lies between the upper and lower bounds of the interval. Point estimate tests allow us to test specific hypotheses about the value of \(\widehat{\beta_{j}}\). Let’s go back to the computation of a \(t\) statistic:
\[t= \frac{\widehat{\beta_{j}}-\beta_{j}}{s.e. \widehat{\beta_{j}}}\]
You’ll notice in the numerator we’re subtracting the true coefficient from our estimate. Of course, we don’t know the true values of \(\beta\), so we choose a value to which we want to compare our estimate. Then we’re generating the probability of drawing a sample with \(\widehat{\beta_{j}}\) given the value we choose.
Hypotheses
Normally, we set up our hypothesis tests such that the value of \(\beta\) we select is zero. Thus, we state null and alternative hypotheses:
\[H_0:\beta_j=0\] \[H_1:\beta_j\ne 0 \]
Then, we choose significance levels, find the critical value associated with that significance level, and compare \(t_{\widehat{\beta_j}}=\widehat{\beta_j}/se(\widehat{\beta_j})\) to the critical value. Finally, we either reject or fail to reject \(H_0\). Note that we could choose other values to which we could compare the probability of \(\widehat{\beta_{j}}\), though selecting those values can be tricky.
Hypotheses
The relationship we expect between \(x\) and \(y\) is the alternative hypothesis.
\(H_a\) is the alternative to the null - the null includes everything that’s not the alternative; mathematically, it always includes zero.
Non-directional alternative hypotheses simply expect \(\widehat{\beta}\) will not be zero. These are not very specific, less desirable.
Directional hypotheses expect a direction, e.g., \(\widehat{\beta}\) will be positive; the null in this case is \(\widehat{\beta} \leq 0\). This is more specific, more desirable.
If you expect \(\widehat{\beta} > 0\), and your estimated \(\widehat{\beta} = -1.2\) with a p-value of .001, you cannot reject the null. While \(\widehat{\beta}\) is statistically different from zero, it’s not in the direction of your hypothesis, so the result is not meaningful, does not support the alternative hypothesis.
Don’t Expect the Null Hypothesis
Don’t Expect the Null
The classical hypothesis test is always constructed such that the null includes zero. It is not possible to expect as the alternative hypothesis, that \(\widehat{\beta} = 0\). We should never see the alternative hypothesis::
X will have no relationship to y: \(H_A: \beta_1=0\)
In the classical test, the null includes a specific point (usually zero), though it can contain that point and a range of other values, e.g.:
\[H_0: \beta_1 \leq 0\]
Here, the null that \(\beta_1\) is less than or equal to zero includes zero, but also includes all negative values. The alternative is that \(\beta_1\) is greater than zero, so the alternative is a range of values, not a specific point.
If we flip this around to say “I expect \(\beta_1\) to be zero”, then we are expecting the null hypothesis. Note that the alternative hypothesis is now a specific point, and the null is a range of values without any specific point expectation.
It’s important to note that the null may be around a point other than zero. We might expect the coefficient to be different from 5, so the null is \(H_0: \beta_1=5\). Or, we might expect the coefficient to be less than or equal to 5, so the null is \(H_0: \beta_1 \leq 5\).
The point is we cannot expect in the alternative that \(\beta_1\) is zero where the null is that it is anything except zero unless we build a new type of test. See Gill (1999) for an excellent discussion of this.
Don’t Expect the Null Hypothesis
Joint Hypothesis Tests
Normally we test hypotheses about specific parameters, but there are cases where we are interested in a hypothesis like this:
\[H_0:\beta_1=\beta_2=0\] \[H_1:\beta_1\ne \beta_2\ne 0\]
To test something like this (and really, tests like this are something we should be extremely interested in), we use the F-statistic.
The joint hypothesis above has two restrictions - we expect two parameters to be zero in the null. We can have as many as \(k-1\) exclusions or restrictions in the model, and we typically call the exclusions or restrictions \(q\).
Intuition
Decompose the variation in the data and model this way:
\[TSS= MSS+RSS\]
Suppose a regression like this:
\[y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 +\beta_5 x_5 + u_i \]
Our usual, single coefficient hypothesis test, say on \(\beta_3\) is
\[H_0: \beta_3=0\]
which is equivalent to
\[y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + {\mathbf{0}} x_3 + \beta_4 x_4 +\beta_5 x_5 + u_i \] call this the unrestricted model:
\[y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 +\beta_5 x_5 + u_i \]
\[RSS_U = \sum_{i=1}^N \widehat{u_U}^2\]
and refer to this as the restricted model:
\[y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + {\mathbf{0}} x_3 + \beta_4 x_4 +\beta_5 x_5 + u_i \]
\[RSS_R = \sum_{i=1}^N \widehat{u_R}^2\]
Are these quantities different from one another?
\[RSS_U \neq RSS_R\]
If
\[RSS_U = RSS_R = 0\]
then the two models are indistinguishable, and \(\beta_3\) is not different from zero. Alternatively, if
\[RSS_U \neq RSS_R \neq 0\]
we can reject the null that the models are the same, and the specific null that \(\beta_3=0\). By the way, this single coefficient F-test is inferentially equivalent to the t-test; in fact, \(F_{\beta_{k}}= t^2_{\beta_{k}}\).
Extension
Extend this to the following hypothesis test:
\[y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 +\beta_5 x_5 + u_i \]
\[H_0: \beta_3=\beta_4=\beta_5=0\]
or that the effects of \(x_3, x_4, x_5\) are individually and jointly zero.
Unrestricted:
\[y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 +\beta_5 x_5 + u_i \]
Restricted:
\[y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + {\mathbf{0}} x_3 + {\mathbf{0}} x_4 +{\mathbf{0}} x_5 + u_i \]
F-test
How do we test \(RSS_U = RSS_R = 0\)?
\[F=\frac{(RSS_R-RSS_{U})/q}{RSS_{U}/(n-k-1)} \]
where \(q\) is the number of restrictions.
The numerator is the difference in fit between the two models (weighted or “normed” by the different number of parameters, the restrictions), and the denominator is a baseline of how well the full model fits. So this is a ratio of improvement to the fit of the full model. Both are distributed \(\chi^2\); their ratio is distributed \(F\).
Alternatively
We could equivalently state this as:
\[F=\frac{(\sum\widehat{u}^{2}_{R}-\sum\widehat{u}^{2}_{U})/q}{\sum\widehat{u}^{2}_{U}/(n-k-1)} \]
Notice that the denominator is also \(\widehat{\sigma^2}\). The resulting statistic is distributed \(F\sim F_{q,n-k-1}\).
or we can write the same thing in terms of the \(R^2\) estimates of the two models:
\[F=\frac{(R^2_{U}-R^2_{R})/q}{(1-R^2_{U})/(n-k-1)} \]
In practice
For a joint hypothesis test, we estimate two nested models which we label the restricted and unrestricted models.
The unrestricted model has all \(k\) variables; the restricted model has \(k-q\) variables, excluding the variables whose joint effect we want to test.
Estimate the models, and examine how much the RSS increases when we exclude the \(q\) variables from the model - of course, the RSS will always increase when we drop variables, but the question is how much will it increase. The F-test measures the increase in RSS in the restricted model relative to that of the unrestricted model.
A common example
Every OLS model reports a “model F-test” - this compares the model you estimated with all your \(x\) variables (this is the unrestricted model) to the null model - no variables, just a constant. The null hypothesis is
\[H_0: \beta_1=\beta_2 \ldots = \beta_k = 0\]
It’s a test of your model against the model with only a constant - it’s comparing the value of your model against the model where our best guess at what explains \(y\) is \(\beta_0 = \bar{y}\).
Tests like this are important and powerful
Our enterprise is really about comparative model testing. Hypothesis tests on individual coefficients do not allow us to compare models. Our theories imply different explanations for phenomena, and thus different empirical models. To treat the theories comparatively, we must test our models comparatively as well. So tests such as the joint hypothesis test are critically important to that endeavor.
This test is a distributional test; that is, its product is a point or range on a known probability distribution. Thus, we can be quite exact about the probability with which we either do or do not reject the joint null hypothesis.
This is not the case of another popular measure of model fit, the \(R^2\), which has no sampling distribution. The \(R^2\) cannot tell us anything probabilistic about the model or more specifically, about the hypothesis that a variable or variables do or do not significantly affect the fit of the model.
Inference on Predictions
We worry a great deal about our uncertainty surrounding \(\widehat{\beta}\). Until relatively recently, few concerned themselves with uncertainty surrounding \(\widehat{y}\) even though \(\widehat{y}= x\widehat{\beta}\) - you’ll note of course \(\widehat{y}\) is a function of the \(\widehat{\beta}\)s.
Just as we estimate the variance for each \(\widehat{\beta}\), we can estimate the variance for each \(\widehat{y}\).
The variance of \(\widehat{\beta}\) is the \(k^{th}\) diagonal element of the matrix \(\sigma^2 \mathbf{(X'X)^{-1}}\). The error variance, \(\sigma^2\) is the sum of squared residuals spread over the degrees of freedom. \(\mathbf{(X'X)^{-1}}\) is the variance-covariance of \(\mathbf{X}\) - the ratio spreads the error variance over the variation in the \(\mathbf{X}\)s.
The variance of \(\widehat{y}\) is \(\mathbf{X\widehat{V}X'}\), where \(\mathbf{\widehat{V}}=\sigma^2 \mathbf{(X'X)^{-1}}\). This is an NxN matrix measuring the variances and covariances of the \(\widehat{y}\)s. The square root of the \(N^{th}\) diagonal is the standard error of the \(N^{th}\) \(\widehat{y}\).
We can use the standard error of \(\widehat{y}_i\) to generate a confidence interval around \(\widehat{y}_i\) in the usual way:
\[CI_{\widehat{y_i}} = \widehat{y}_i \pm c \cdot {s.e. \widehat{y_i}} \]
usually
\[CI_{\widehat{y_i}} = \widehat{y}_i \pm 1.96 \cdot {s.e. \widehat{y_i}} \]